Ranking Gene-Drug Relationships in Biomedical Literature Using Latent Dirichlet Allocation
نویسندگان
چکیده
Drug responses vary greatly among individuals due to human genetic variations, which is known as pharmacogenomics (PGx). Much of the PGx knowledge has been embedded in biomedical literature and there is a growing interest to develop text mining approaches to extract such knowledge. In this paper, we present a study to rank candidate gene-drug relations using Latent Dirichlet Allocation (LDA) model. Our approach consists of three steps: 1) recognize gene and drug entities in MEDLINE abstracts; 2) extract candidate gene-drug pairs based on different levels of co-occurrence, including abstract level, sentence level, and phrase level; and 3) rank candidate gene-drug pairs using multiple different methods including term frequency, Chi-square test, Mutual Information (MI), a reported Kullback-Leibler (KL) distance based on topics derived from LDA (LDA-KL), and a newly defined probabilistic KL distance based on LDA (LDA-PKL). We systematically evaluated these methods by using a gold standard data set of gene-drug relations derived from PharmGKB. Our results showed that the proposed LDA-PKL method achieved better Mean Average Precision (MAP) than any other methods, suggesting its promising uses for ranking and detecting PGx relations.
منابع مشابه
Correlated Topic Model for Web Services Ranking
With the increasing number of published Web services providing similar functionalities, it’s very tedious for a service consumer to make decision to select the appropriate one according to her/his needs. In this paper, we explore several probabilistic topic models: Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA) and Correlated Topic Model (CTM) to extract latent...
متن کاملModeling Protein-Protein Interactions in Biomedical Abstracts with Latent Dirichlet Allocation
A major goal in biomedical text processing is the automatic extraction of protein interaction information from scientific articles or abstracts. We approach this task with a topic-based generative model. Under the model, sentences in biomedical abstracts can be generated by either an ’interaction’ topic if they contain or discuss interacting proteins or a ’background’ topic otherwise. This stru...
متن کاملAspect-Specific Ranking of Product Reviews Using Topic Modeling
We examine the problem of ranking different aspects of a product through examination of its customer reviews. For instance, a restaurant review may contain distinct and possibly differing opinions on the food, decor, service, and price. We present a ranking system that uses Latent Dirichlet Allocation (LDA) and a database of opinion-oriented words to predict the aspect-specific sentiment of ind...
متن کاملClustering Images Using the Latent Dirichlet Allocation Model
Clustering, in simple words, is grouping similar data items together. In the text domain, clustering is largely popular and fairly successful. In this work, we try and apply clustering methods that are used in the text domain, to the image domain. Two major challenges in this approach are image representation and vocabulary definition. We apply the bag-of-words model to images using image segme...
متن کاملBridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents
The main approach of traditional information retrieval (IR) is to examine how many words from a query appear in a document. A drawback of this approach, however, is that it may fail to detect relevant documents where no or only few words from a query are found. The semantic analysis methods such as LSA (latent semantic analysis) and LDA (latent Dirichlet allocation) have been proposed to addres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
دوره شماره
صفحات -
تاریخ انتشار 2012